[Enhancement] Use weighted ranking to cap refinement candidates (CF-931) by mohammedahmed18 · Pull Request #962 · codeflash-ai/codeflash

mohammedahmed18 · 2025-12-10T14:51:07Z

PR Type

Enhancement

Description

Rank refinements by runtime and diff
Add normalization and weighting utilities
Change refinement request payload to ints
Cap refinements to top 45% candidates

Diagram Walkthrough

flowchart LR
  A["Valid optimizations"] -- "compute diff, runtime" --> B["Normalize metrics"]
  B["Normalize metrics"] -- "apply weights (runtime,diff)" --> C["Score candidates"]
  C["Score candidates"] -- "select top 45% or <=2 all" --> D["Submit refinements"]
  D["Submit refinements"] -- "parallel futures" --> E["Queue refined candidates"]

File Walkthrough

Relevant files

Enhancement

aiservice.py `Humanize runtime fields; trim logs` codeflash/api/aiservice.py Send runtimes as humanized strings Remove extra debug/console logging	+2/-6
code_utils.py `Utilities for weighting and normalization` codeflash/code_utils/code_utils.py Add choose_weights helper Add normalize utility Add weighted score dictionary builder	+57/-0
models.py `Refiner request runtime type to int` codeflash/models/models.py Change refiner request runtimes to int	+2/-2
function_optimizer.py `Weighted, selective, parallel refinement flow` codeflash/optimization/function_optimizer.py Queue-based refinement selection by weighted score Build refiner requests locally with int runtimes Submit refinements selectively and in parallel Inject AIS client/executor into processor	+66/-50

Configuration changes

config_consts.py `Config for weighted refinement capping` codeflash/code_utils/config_consts.py Add refinement thresholds and weights Introduce top-N refinement ratio	+5/-0

github-actions · 2025-12-10T14:52:08Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue Variable name typo: 'top_indecies' appears to be a misspelling of 'top_indices'. While functional, this reduces readability and may cause confusion; consider renaming. top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates] for idx in top_indecies: data = self.all_refinements_data[idx] Type Mismatch Function 'create_score_dictionary_from_metrics' declares return type dict[int, int] but builds a dict of floats; adjust the return type to dict[int, float] or cast appropriately. def create_score_dictionary_from_metrics(weights: list[float], metrics: list[float]) -> dict[int, int]: """Combine multiple metrics into a single weighted score dictionary. Each metric is a list of values (smaller = better). The total score for each index is the weighted sum of its values across all metrics: score[index] = Σ (value weight) Args: weights: A list of weights, one per metric. Larger weight = more influence. metrics: Lists of values (one list per metric, aligned by index). Returns: A dictionary mapping each index to its combined weighted score. """ if len(weights) != len(metrics): raise ValueError("Number of weights must match number of metrics") combined: dict[int, float] = {} for weight, metric in zip(weights, metrics): for idx, value in enumerate(metric): combined[idx] = combined.get(idx, 0) + value weight return combined Edge Case Handling 'normalize' returns all zeros when values are constant; the ranking then ties and slicing top N may be arbitrary. Consider deterministic tie-breaking or ensuring at least one candidate is selected when TOP_N_REFINEMENTS rounds to 0 for small lists > REFINE_ALL_THRESHOLD. runtime_w, diff_w = REFINED_CANDIDATE_RANKING_WEIGHTS weights = choose_weights(runtime=runtime_w, diff=diff_w) runtime_norm = normalize(runtimes_list) diffs_norm = normalize(diff_lens_list) # the lower the better score_dict = create_score_dictionary_from_metrics(weights, runtime_norm, diffs_norm) top_n_candidates = int((TOP_N_REFINEMENTS * len(runtimes_list)) + 0.5) top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates] for idx in top_indecies: data = self.all_refinements_data[idx]

github-actions · 2025-12-10T14:52:37Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix score dict typing and validation The return type is declared as `dict[int, int]` but the values are floats from weighted sums, which can cause type misuse. Also validate that all metric lists have equal length to prevent silent index skew. Update the annotation to `dict[int, float]` and add a uniform-length check. codeflash/code_utils/code_utils.py [72-98] -def create_score_dictionary_from_metrics(weights: list[float], metrics: list[float]) -> dict[int, int]: - """Combine multiple metrics into a single weighted score dictionary. - - Each metric is a list of values (smaller = better). - The total score for each index is the weighted sum of its values - across all metrics: - - score[index] = Σ (value weight) - - Args: - weights: A list of weights, one per metric. Larger weight = more influence. - metrics: Lists of values (one list per metric, aligned by index). - - Returns: - A dictionary mapping each index to its combined weighted score. - - """ +def create_score_dictionary_from_metrics(weights: list[float], metrics: list[float]) -> dict[int, float]: + """Combine multiple metrics into a single weighted score dictionary.""" if len(weights) != len(metrics): raise ValueError("Number of weights must match number of metrics") + lengths = {len(m) for m in metrics} + if len(lengths) > 1: + raise ValueError("All metric lists must have the same length") combined: dict[int, float] = {} - for weight, metric in zip(weights, metrics): for idx, value in enumerate(metric): - combined[idx] = combined.get(idx, 0) + value * weight - + combined[idx] = combined.get(idx, 0.0) + value * weight return combined Suggestion importance[1-10]: 7 __ Why: Correctly identifies a typing mismatch (`dict[int, int]` vs float values) and adds a sensible length validation; both improve correctness and maintainability without altering logic.	Medium
Possible issue	Ensure at least one refinement When `TOP_N_REFINEMENTS` yields 0 (e.g., small lists or low fraction), no candidates are refined, stalling refinement. Ensure at least one candidate is selected when there are items, and correct the typo in `top_indecies` to avoid confusion. Clamp the count to `[1, len(...)]`. codeflash/optimization/function_optimizer.py [205-211] top_n_candidates = int((TOP_N_REFINEMENTS * len(runtimes_list)) + 0.5) -top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates] +if len(runtimes_list) > 0: + top_n_candidates = max(1, min(top_n_candidates, len(runtimes_list))) +top_indices = sorted(score_dict, key=score_dict.get)[:top_n_candidates] -for idx in top_indecies: +for idx in top_indices: data = self.all_refinements_data[idx] future_refinements.append(self.refine_optimizations([data])) Suggestion importance[1-10]: 6 __ Why: Clamping the number of selected candidates avoids a potential no-op when rounding yields 0 and fixes a typo; useful but not critical since a low TOP_N_REFINEMENTS might be intentional.	Low
General	Guard empty metric lists Lower normalized values are better, but higher runtime/diff indicate worse candidates. If lower is better, this line is fine; however confirm normalization semantics by explicitly documenting and guarding against negative or empty lists. Add an early return when lists are empty to avoid downstream errors. codeflash/optimization/function_optimizer.py [201-206] -runtime_norm = normalize(runtimes_list) -diffs_norm = normalize(diff_lens_list) -# the lower the better +if not runtimes_list or not diff_lens_list: + self.refinement_done = True + return self.get_next_candidate() + +runtime_norm = normalize(runtimes_list) # lower is better +diffs_norm = normalize(diff_lens_list) # lower is better score_dict = create_score_dictionary_from_metrics(weights, runtime_norm, diffs_norm) Suggestion importance[1-10]: 5 __ Why: Adding an explicit empty-list guard improves robustness and aligns with later use; moderate impact as upstream logic likely ensures non-empty inputs, but the check is harmless and clarifies intent.	Low

codeflash-ai · 2025-12-10T14:56:44Z

⚡️ Codeflash found optimizations for this PR

📄 115% (1.15x) speedup for `AiServiceClient.optimize_python_code_refinement` in `codeflash/api/aiservice.py`

⏱️ Runtime : 32.8 milliseconds → 15.3 milliseconds (best of 65 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method AiServiceClient.optimize_python_code_refinement by 115% in PR #962 (limit-refined-candidates) #963

If you approve, it will be merged into this PR (branch limit-refined-candidates).

KRRT7

I really like this, just implement the changes that the PR review bot gave you

…efined-candidates

KRRT7 · 2025-12-12T16:36:47Z

I suspect this will help with the long runtimes of our tracer-replay as well now that I think of it

misrasaurabh1 · 2025-12-16T02:10:31Z

+# Refinement
+REFINE_ALL_THRESHOLD = 2  # when valid optimizations count is 2 or less, refine all optimizations
+REFINED_CANDIDATE_RANKING_WEIGHTS = (2, 1)  # (runtime, diff), runtime is more important than diff by a factor of 2
+TOP_N_REFINEMENTS = 0.45  # top 45% of valid optimizations (based on the weighted score) are refined


any reason for this number?

nothing in particular, was thinking of making it a fixed number, maybe 3 ?
@misrasaurabh1 @KRRT7 @aseembits93

misrasaurabh1 · 2025-12-16T02:12:37Z

+    return [v / total for v in importance.values()]
+
+
+def normalize(values: list[float]) -> list[float]:


can you rename this function to min_max_normalize? normalize is too broad

misrasaurabh1 · 2025-12-16T02:24:13Z

+            weights = choose_weights(runtime=runtime_w, diff=diff_w)
+
+            runtime_norm = normalize(runtimes_list)
+            diffs_norm = normalize(diff_lens_list)


i am wondering if min_max_normalization for these are a good idea.
With this, every code with minimal runtime or diff_len will have a weighted value of 0. Every maximal will have a value of 1. It won't matter even if the difference between the min and the max is miniscule.

The problem i see is that min-max normalization gets rid of the relative scale of the runtime or the diff lens.

Instead of normalizing with min = minimal data point, why not try with min = 0? Diff len or runtime can only ever be as small as 0, and with this formulation we can think of the values as a vector emanating from origin and we give the largest datapoint a value of 1 and the minimal one as some number relative to the maginitude b/w 0 and the max number. So if the runtime is half of max, then the score of 0.5 sounds reasonable rather than 0.

This preserve a sense of scale

nice, this is definitely more accurate

misrasaurabh1 · 2025-12-19T02:19:57Z

@mohammedahmed18 lets merge this. you should keep an eye out to see if the filtering still makes sense with the new ranking logic

limit the refined candidates based on the weighted ranking

e895c52

mohammedahmed18 requested review from KRRT7, aseembits93 and misrasaurabh1 and removed request for KRRT7 and aseembits93 December 10, 2025 14:51

github-actions Bot added the Review effort 3/5 label Dec 10, 2025

codeflash-ai Bot mentioned this pull request Dec 10, 2025

⚡️ Speed up method AiServiceClient.optimize_python_code_refinement by 115% in PR #962 (limit-refined-candidates) #963

Closed

KRRT7 suggested changes Dec 12, 2025

View reviewed changes

Merge branch 'main' of github.com:codeflash-ai/codeflash into limit-r…

ac2a3a1

…efined-candidates

misrasaurabh1 reviewed Dec 16, 2025

View reviewed changes

mohammedahmed18 and others added 2 commits December 16, 2025 20:54

normalize by max

bc735a3

Merge branch 'main' into limit-refined-candidates

89d1f0e

misrasaurabh1 approved these changes Dec 19, 2025

View reviewed changes

Merge branch 'main' into limit-refined-candidates

0a5649f

KRRT7 approved these changes Dec 19, 2025

View reviewed changes

mohammedahmed18 merged commit 872ec28 into main Dec 19, 2025
22 checks passed

mohammedahmed18 deleted the limit-refined-candidates branch December 19, 2025 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Use weighted ranking to cap refinement candidates (CF-931)#962

[Enhancement] Use weighted ranking to cap refinement candidates (CF-931)#962
mohammedahmed18 merged 5 commits into
mainfrom
limit-refined-candidates

mohammedahmed18 commented Dec 10, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Dec 10, 2025

Uh oh!

github-actions Bot commented Dec 10, 2025

Uh oh!

codeflash-ai Bot commented Dec 10, 2025

⚡️ Speed up method `AiServiceClient.optimize_python_code_refinement` by 115% in PR #962 (`limit-refined-candidates`) #963

Uh oh!

KRRT7 left a comment

Uh oh!

KRRT7 commented Dec 12, 2025

Uh oh!

misrasaurabh1 Dec 16, 2025

Uh oh!

mohammedahmed18 Dec 16, 2025

Uh oh!

misrasaurabh1 Dec 16, 2025

Uh oh!

misrasaurabh1 Dec 16, 2025

Uh oh!

mohammedahmed18 Dec 16, 2025

Uh oh!

misrasaurabh1 commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return [v / total for v in importance.values()]


		def normalize(values: list[float]) -> list[float]:

Conversation

mohammedahmed18 commented Dec 10, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions Bot commented Dec 10, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Dec 10, 2025

PR Code Suggestions ✨

Uh oh!

codeflash-ai Bot commented Dec 10, 2025

⚡️ Codeflash found optimizations for this PR

📄 115% (1.15x) speedup for AiServiceClient.optimize_python_code_refinement in codeflash/api/aiservice.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method AiServiceClient.optimize_python_code_refinement by 115% in PR #962 (limit-refined-candidates) #963

Uh oh!

KRRT7 left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Dec 12, 2025

Uh oh!

misrasaurabh1 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mohammedahmed18 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

misrasaurabh1 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

misrasaurabh1 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mohammedahmed18 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

misrasaurabh1 commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohammedahmed18 commented Dec 10, 2025 •

edited by github-actions Bot

Loading

📄 115% (1.15x) speedup for `AiServiceClient.optimize_python_code_refinement` in `codeflash/api/aiservice.py`

⚡️ Speed up method `AiServiceClient.optimize_python_code_refinement` by 115% in PR #962 (`limit-refined-candidates`) #963